Skip to content

Comments

Parallel manager2#1

Open
lolyu wants to merge 43 commits intomasterfrom
parallel_manager2
Open

Parallel manager2#1
lolyu wants to merge 43 commits intomasterfrom
parallel_manager2

Conversation

@lolyu
Copy link
Owner

@lolyu lolyu commented Feb 2, 2026

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@lolyu lolyu closed this Feb 2, 2026
@lolyu lolyu reopened this Feb 2, 2026
@lolyu
Copy link
Owner Author

lolyu commented Feb 2, 2026

@greptile

@lolyu lolyu closed this Feb 2, 2026
@lolyu lolyu reopened this Feb 2, 2026
@greptile-apps
Copy link

greptile-apps bot commented Feb 2, 2026

Greptile Overview

Greptile Summary

This PR introduces a parallel fixture execution framework to optimize test execution time by parallelizing fixture setup/teardown operations. The implementation includes:

  • Parallel Fixture Manager: A new pytest plugin (parallel_fixture) that uses ThreadPoolExecutor to execute fixture setup/teardown tasks concurrently across different scopes (SESSION, MODULE, CLASS, FUNCTION)
  • Scope-based barriers: Automatic synchronization points that ensure all parallel tasks complete before proceeding to the next test scope
  • Deadlock prevention: Fork-safety mechanisms using os.register_at_fork to prevent logging and Ansible display deadlocks in multi-threaded/multi-process environments
  • Closure bug fixes: Corrected several __getattr__ implementations in base.py, duthosts.py, and multi_asic.py where module names were incorrectly captured by reference instead of by value
  • Docker capability: Added SYS_PTRACE capability to enable thread interruption for timeout handling

The architecture is well-documented and addresses real concurrency challenges in the SONiC test infrastructure. However, there are several logical issues that need attention:

  • Missing return value check in PyThreadState_SetAsyncExc could cause silent failures when interrupting threads
  • Exception re-raising loses original traceback context, making debugging harder
  • The reset() method doesn't clear the terminated flag, potentially preventing the manager from being reused
  • Potential AttributeError when checking handler.lock that could be None

Confidence Score: 3/5

  • This PR has logical issues that could cause runtime failures in edge cases but is generally sound in design
  • The PR implements a complex parallel execution framework with good architecture and documentation. However, several critical logical issues were found: (1) missing return value checks for thread injection API that could silently fail, (2) exception handling that loses traceback context, (3) the reset mechanism doesn't clear the terminated flag which breaks reusability, and (4) potential None reference on handler.lock. These issues should be addressed before merging to ensure reliability in production.
  • tests/common/plugins/parallel_fixture/__init__.py and tests/common/helpers/parallel.py need attention for the logical issues mentioned

Important Files Changed

Filename Overview
tests/common/devices/base.py Fixed closure bug where module_name was captured incorrectly by returning a wrapper function
tests/common/devices/duthosts.py Fixed closure bug and added explicit initialization of _nodes attributes to prevent __getattr__ recursion
tests/common/devices/multi_asic.py Fixed closure bug where multi_asic_attr was captured incorrectly by returning a wrapper function
tests/common/helpers/parallel.py Added deadlock prevention for fork operations with logging locks, patched Ansible worker processes
tests/common/plugins/parallel_fixture/init.py Implemented parallel fixture execution framework with thread pool, scope-based barriers, and exception handling

Sequence Diagram

sequenceDiagram
    participant Pytest
    participant ParallelManager
    participant ThreadPool
    participant MonitorThread
    participant Fixture
    participant Barrier

    Note over Pytest,Barrier: Session Setup Phase
    Pytest->>ParallelManager: Create (session scope)
    ParallelManager->>ThreadPool: Initialize worker threads
    ParallelManager->>MonitorThread: Start monitoring
    
    Pytest->>Fixture: Execute fixture (session scope)
    Fixture->>ParallelManager: submit_setup_task(SESSION, func)
    ParallelManager->>ThreadPool: Submit task to thread pool
    ThreadPool-->>ParallelManager: Return future
    Fixture-->>Pytest: Yield immediately
    
    Pytest->>Barrier: setup_barrier_session
    Barrier->>ParallelManager: wait_for_setup_tasks(SESSION)
    ParallelManager->>ThreadPool: Wait for all session tasks
    ThreadPool-->>ParallelManager: Tasks complete
    
    Note over Pytest,Barrier: Module/Class/Function Scopes
    Pytest->>Fixture: Execute fixture (module scope)
    Fixture->>ParallelManager: submit_setup_task(MODULE, func)
    ParallelManager->>ThreadPool: Submit to pool
    Fixture-->>Pytest: Yield
    
    Pytest->>Barrier: setup_barrier_module
    Barrier->>ParallelManager: wait_for_setup_tasks(MODULE)
    ParallelManager->>ThreadPool: Wait for module tasks
    
    Note over Pytest,Barrier: Test Execution
    Pytest->>ParallelManager: pytest_runtest_call hook
    ParallelManager->>ThreadPool: Ensure all tasks complete
    ParallelManager->>ThreadPool: Terminate executor
    Pytest->>Pytest: Run test function
    
    Note over Pytest,Barrier: Teardown Phase
    Pytest->>ParallelManager: pytest_runtest_teardown hook
    ParallelManager->>ThreadPool: Reset and create new executor
    ParallelManager->>MonitorThread: Restart monitoring
    
    Pytest->>Fixture: Teardown fixture
    Fixture->>ParallelManager: submit_teardown_task(scope, func)
    ParallelManager->>ThreadPool: Submit teardown task
    Fixture-->>Pytest: Return
    
    Pytest->>Barrier: teardown_barrier_function
    Barrier->>ParallelManager: wait_for_teardown_tasks(FUNCTION)
    ParallelManager->>ThreadPool: Wait for function teardowns
    
    Pytest->>ParallelManager: pytest_runtest_logreport hook
    ParallelManager->>ThreadPool: Terminate executor
    ParallelManager->>MonitorThread: Stop monitoring
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +38 to +44
def raise_async_exception(tid, exc_type):
"""Injects an exception into the specified thread."""
if not isinstance(tid, int):
raise TypeError("Thread ID must be an integer")

ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),
ctypes.py_object(exc_type))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyThreadState_SetAsyncExc call doesn't check the return value. According to Python docs, it returns the number of thread states modified. If it returns 0, the thread ID is invalid. If it returns >1, multiple thread states were affected (shouldn't happen but indicates a serious issue).

Suggested change
def raise_async_exception(tid, exc_type):
"""Injects an exception into the specified thread."""
if not isinstance(tid, int):
raise TypeError("Thread ID must be an integer")
ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),
ctypes.py_object(exc_type))
def raise_async_exception(tid, exc_type):
"""Injects an exception into the specified thread."""
if not isinstance(tid, int):
raise TypeError("Thread ID must be an integer")
ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),
ctypes.py_object(exc_type))
if ret == 0:
logging.warning("Thread ID %s not found when injecting exception", tid)
elif ret > 1:
# Revert if we affected multiple threads
ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), None)
raise SystemError(f"PyThreadState_SetAsyncExc affected {ret} threads")
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 38:44

Comment:
The `PyThreadState_SetAsyncExc` call doesn't check the return value. According to Python docs, it returns the number of thread states modified. If it returns 0, the thread ID is invalid. If it returns >1, multiple thread states were affected (shouldn't happen but indicates a serious issue).

```suggestion
def raise_async_exception(tid, exc_type):
    """Injects an exception into the specified thread."""
    if not isinstance(tid, int):
        raise TypeError("Thread ID must be an integer")

    ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid),
                                                      ctypes.py_object(exc_type))
    if ret == 0:
        logging.warning("Thread ID %s not found when injecting exception", tid)
    elif ret > 1:
        # Revert if we affected multiple threads
        ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(tid), None)
        raise SystemError(f"PyThreadState_SetAsyncExc affected {ret} threads")
```

How can I resolve this? If you propose a fix, please make it concise.

try:
return func(*args, **kwargs)
except Exception:
raise sys.exc_info()[0](traceback.format_exc())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception re-raising logic uses deprecated sys.exc_info()[0] and loses the original exception chain. This makes debugging harder and may not properly preserve the exception context.

Suggested change
raise sys.exc_info()[0](traceback.format_exc())
raise
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 266:266

Comment:
The exception re-raising logic uses deprecated `sys.exc_info()[0]` and loses the original exception chain. This makes debugging harder and may not properly preserve the exception context.

```suggestion
                raise
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +393 to +394
self.active_futures.clear()
self.done_futures.clear()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing the futures before termination check creates a race condition. If terminate() hasn't been called yet, clearing these sets will lose track of active tasks. The terminated flag should be checked before clearing.

Suggested change
self.active_futures.clear()
self.done_futures.clear()
if not self.terminated:
raise RuntimeError("Cannot reset a running parallel fixture manager.")
self.active_futures.clear()
self.done_futures.clear()
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 393:394

Comment:
Clearing the futures before termination check creates a race condition. If `terminate()` hasn't been called yet, clearing these sets will lose track of active tasks. The `terminated` flag should be checked before clearing.

```suggestion
        if not self.terminated:
            raise RuntimeError("Cannot reset a running parallel fixture manager.")
        
        self.active_futures.clear()
        self.done_futures.clear()
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 388 to 398
def reset(self):
"""Reset the parallel fixture manager."""
if not self.terminated:
raise RuntimeError("Cannot reset a running parallel fixture manager.")

self.active_futures.clear()
self.done_futures.clear()
self.executor = ThreadPoolExecutor(max_workers=self.worker_count)
self.is_monitor_running = True
self.monitor_thread = threading.Thread(target=self._monitor_workers, daemon=True)
self.monitor_thread.start()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset() doesn't set self.terminated = False after resetting the executor. This means once terminated, the manager can never be used again because terminate() will immediately return at line 335 due to the terminated check. Either reset should set terminated = False, or the design should clarify if reset is only for internal reuse.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/common/plugins/parallel_fixture/__init__.py
Line: 388:398

Comment:
`reset()` doesn't set `self.terminated = False` after resetting the executor. This means once terminated, the manager can never be used again because `terminate()` will immediately return at line 335 due to the terminated check. Either reset should set `terminated = False`, or the design should clarify if reset is only for internal reuse.

How can I resolve this? If you propose a fix, please make it concise.

@lolyu lolyu force-pushed the parallel_manager2 branch 7 times, most recently from 8ab4754 to e1bc7a9 Compare February 5, 2026 06:09
hdwhdw and others added 16 commits February 6, 2026 08:43
Replace shell-based openssl certificate generation with Python
cryptography library. Certificates are now backdated by 1 day
to handle clock skew between test host, DUT, and PTF container.

This fixes the error:
"tls: failed to verify certificate: x509: certificate has expired
or is not yet valid: current time ... is before ..."

Changes:
- Add tests/common/cert_utils.py with TlsCertificateGenerator class
- Add create_gnmi_cert_generator() factory for gNMI/gNOI naming
- Update grpc_fixtures.py to use Python-based cert generation
- Remove localhost fixture dependency from setup_gnoi_tls_server

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
What is the motivation for this PR?
Test DUTs with 64 ports connected to peers.

How did you do it?
How did you verify/test it?
Prepare the DUT and verified bgp/test_bgp_fact passes.

Signed-off-by: Vinay Kaza <vinay@nexthop.ai>

* Updated with latest tested topology

Signed-off-by: Vinay Kaza <vinay@nexthop.ai>

* Fix end of file static checker

Signed-off-by: Vinay Kaza <vinay@nexthop.ai>

* Change neighbor VM name from T1 to LT2

Signed-off-by: Vinay Kaza <vinay@nexthop.ai>

---------

Signed-off-by: Vinay Kaza <vinay@nexthop.ai>
…sonic-net#21990)

What is the motivation for this PR?
Running many individual SSH commands in test scripts resulted in excessive overhead due to repeated SSH session establishments, leading to unnecessarily long test times. The total test time was 1 hr 26 mins, current implementation makes one SSH call for 450 ports and its each 7 UC queues separately, making roughly 3150 ssh commands assuming each SSH call takes 1 to 1.5 mins, 1 hr 20 mins of time is spent here fetching the data.

The motivation is to improve test efficiency and make large test runs more practical by reducing SSH connection overhead.

How did you do it?
Identified places in the test where SSH commands are executed in sequence or within loops.
Rewrote those sections to batch multiple commands together and execute them through a single SSH session, fetching show queue counters for all ports and queue together.
Ensured that command outputs and error handling remained compatible with existing test validations.

How did you verify/test it?
Ran affected test scripts before and after the changes to compare results and execution time.
Verified that all expected outputs and pass/fail criteria remain the same.
Monitored for regressions, failures, or unexpected test behaviors possibly introduced by batching commands.
The execution time dropped from 1 hr 26 mins to 3 mins.

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
)

Summary:
Replace fixed sleeps with polling and reduce wait times:

Add polling helpers: wait_for_crm_counter_update(), wait_for_resource_stabilization()
Replace 50s resource waits with adaptive polling
Reduce config waits from 10s to 5s (CONFIG_UPDATE_TIME)
Reduce cleanup wait from 50s to 20s (SONIC_RES_CLEANUP_UPDATE_TIME)

Signed-off-by: AntonHryshchuk <antonh@nvidia.com>
…le topos. (sonic-net#22169)

What is the motivation for this PR?
The generic hash test run time depends on the number of egress ports in the topology. When running on the scale topology like t0-isolated-d256u256s2, it takes more than 6 hours. This change is to optimize the runtime. The regular topologies also benefit from this.

How did you do it?
Decrease the total number of sent packets based on the number of egress ports. Meanwhile relax the threshold for balancing check also based on the number of egress port.
Skip the lag related cases on topologies which have no lags.
Skip some test cases when there is not enough test ports or the variations in the tested field are too few. For example when the hash field is IP_PROTOCOL, there are total 256 values, it not enough to hash to all 256 next hops perfectly.
Improve the test result output.
Add asic db route check in reboot test case. After reboot/reload, it takes time for the default route to be installed in the ASIC, only checking the kernel route is not enough for scale topos.

How did you verify/test it?
Run the test on SN5640 with t0-isolated-d256u256s2 topo. The total runtime is about 1.5 ~ 2 hours now.

Signed-off-by: Cong Hou <congh@nvidia.com>
…onic-net#22287)

Description of PR
Changed container_upgrade tests to support restapi containers (restapi, restapi_watchdog, and restapi_sidecar).

Summary:
Microsoft ADO ID: 36672179

Approach
What is the motivation for this PR?
Adding support for restapi containers in container_upgrade tests so that we can run restapi tests using the same container images on different OS versions.

How did you do it?
Added instructions for creating restapi containers and running restapi tests in container_upgrade.

How did you verify/test it?
Tested on a T1 switch running 202505. The restapi tests passed.
What is the motivation for this PR?
Fix test_bgp_suppress_fib.py flakiness on scale topos. Extension of BULK_TRAFFIC_WAIT_TIME and ignoring malformed BGP route packets allows these tests to pass consistently on Arista t1 isolated DUTs

How did you do it?
How did you verify/test it?
Tested on T0 and T1 isolated Arista DUTs
* Fix AttributeError in conftest.py

In the ptfhosts pytest fixture in conftest.py, a method "_hosts.apend"
appears, which is a typo corrected by this change to "_hosts.append".

Signed-off-by: Christopher Croy <ccroy@arista.com>

* Fix flake8 lint indentation error

Signed-off-by: Christopher Croy <ccroy@arista.com>

---------

Signed-off-by: Christopher Croy <ccroy@arista.com>
…ic-net#22201)

orch_northbond_route_zmq_enabled was enabled in:
sonic-net#20441
which enables fpmsyncd to send route events to orchagent via a ZMQ
channel instead of Redis.
test_bgp_route_with_suppress_negative_operation kills orchagent and
restarts bgp session. After restart, fpmsyncd tries to send all updates
to orchagent via ZMQ, but orchagent is still down. Eventually, ZMQ
buffer is filled up and causes the error.
Ignore this error log as it is expected.

Signed-off-by: markxiao <markxiao@arista.com>
Signed-off-by: Austin Pham <austinpham@microsoft.com>
…c-net#21561)

What is the motivation for this PR?
Expand the current NTP test case to verify that changing the source interface configuration for NTP works correctly.

How did you do it?
Add a test case to change the NTP source interface to Loopback0, and then back to eth0.

How did you verify/test it?
Tested on T0 KVM.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
What is the motivation for this PR?
Add a test case to verify the ability to add and remove DNS nameservers via incremental config.

How did you do it?
Model this after the NTP GCU test case, and add two random DNS nameservers and remove one.

How did you verify/test it?
Tested locally on KVM T0.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
lakshmi-nexthop and others added 9 commits February 9, 2026 11:55
…System (sonic-net#20872)

What is the motivation for this PR?
sonic-mgmt tests need to be improved for FS voq
How did you do it?
access to Chassis DB is done only if the voq is a chassis
skipped test_po_voq.py for single ASIC FS as this test checks for additons/deletions/updates of LAGs across all asics in a chassis using chassis APP DB
How did you verify/test it?
ran the tests on FS voq

Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
Co-authored-by: Xin Wang <xiwang5@microsoft.com>
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
…r handling (sonic-net#22082)

Description of PR
Summary: Fixes ethernet interface test failures due to port availability issues and routeCheck errors on multi-ASIC platforms

The below PR is also needed to fix the test completely:
PR for FEC related issue - sonic-net#21026

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement

What is the motivation for this PR?
Ethernet interface tests would fail when all ethernet ports were members of PortChannels and no free ports were available for testing. Additionally, tests were failing during teardown with routeCheck errors, particularly on multi-ASIC platforms, where port configuration changes (speed, FEC, admin status) triggered temporary routing inconsistencies that were flagged by the routeCheck monitoring service.

How did you do it?
Enhanced port availability logic:

Added get_test_port() to automatically remove a port from its PortChannel if no free ports are available
Added remove_port_from_portchannel() helper function
Port removal is temporary - the ensure_dut_readiness fixture's rollback mechanism automatically restores the original configuration
Added MTU test protection: checks for free ports without removing from PortChannel (remove_from_portchannel=False), and skips test if none available to avoid routing convergence issues
Added loganalyzer fixture (ignore_expected_loganalyzer_exceptions_lag) to suppress expected routeCheck errors that occur during port configuration changes. The fixture ignores:

orchagent port speed errors
routeCheck temporary failures with unaccounted routes
How did you verify/test it?
Ran all ethernet interface tests on multi-ASIC linecard platforms where all ports were in PortChannels
Verified that tests successfully complete without routeCheck failures during teardown
Confirmed that port removal and restoration works correctly via rollback mechanism

Signed-off-by: Nanma Purushotam <gupurush@gmail.com>

* Trigger CI checks

Signed-off-by: Nanma Purushotam <gupurush@gmail.com>

* Updated conftest, and removed test based loganalyzer


Signed-off-by: Nanma Purushotam <gupurush@gmail.com>
Fix incorrect fallback value for serial device prefix in _get_serial_device_prefix(). Changed from /dev/ttyUSB- to /dev/ttyUSB to generate correct device paths (e.g., /dev/ttyUSB1 instead of /dev/ttyUSB-1).

What is the motivation for this PR?
When udevprefix.conf cannot be read, the fallback value /dev/ttyUSB- generates invalid device paths.

How did you do it?
Changed fallback from /dev/ttyUSB- to /dev/ttyUSB and updated related docstrings/comments.

How did you verify/test it?
Code review.

Signed-off-by: cliffchen <t-cliffchen+github@microsoft.com>
What is the motivation for this PR?
Enable test_po_update with ipv6-only topo

How did you do it?
Added steps for ipv6

How did you verify/test it?
Ran on ipv6-only topo testbed
What is the motivation for this PR?
To enable ipv6 only topo for test_static_route

How did you do it?
Have a check for ipv6 and use appropriate prefix for it.

How did you verify/test it?
Tested on ipv6 only testbed to verify
…pv6-default] since it is duplicated with skip condition (sonic-net#21759)

What is the motivation for this PR?
Remove duplicated xfail, since it should already been cover by the skip condition.
So conditions can be more clear, less confusion and less maintenance.
What is the motivation for this PR?
The previous test_ipinip teardown could leave the dut in an unhealthy state after the test finishes, causing several docker containers to remain down and impacting subsequent tests.

How did you do it?
Updated the test_ipinip teardown logic to restore the dut by performing a golden config reload.

How did you verify/test it?
Ran test_ipinip and passed.
DPU config can be deployed after the switch configuration.

Minigraph deployment command has been extended:
Deploy minigraph for both Switch and DPU:
./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv
Deploy minigraph for Switch only:
./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv --skip-tags "dpu_config"
Deploy minigraph for DPU only:
./testbed-cli.sh deploy-mg {{SWITCH}}-{{TOPO}} lab vault -vvvvv --tags "dpu_config"

Signed-off-by: nmirin <nikolay.a.mirin@gmail.com>
@lolyu lolyu force-pushed the parallel_manager2 branch from 42e170d to 70897f8 Compare February 10, 2026 06:03
lolyu added 18 commits February 10, 2026 06:18
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
@lolyu lolyu force-pushed the parallel_manager2 branch from 70897f8 to 9f0c4af Compare February 10, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.